Initializing PySpark for Hadoop Distribution

If you have configured a Hadoop Cluster (EMR/CDH) in the Chorus Data section, you can use PySpark to read and write HDFS data. This notebook can be exposed as a Python Execute operator which can be used by the downstream operators in a workflow.

This method is far more efficient than other methods for medium or large data sets and is the only viable option for reading data sets having a size of more than a few GB.

You can initialize and use PySpark in your Jupyter Notebooks for Team Studio.

Start in the Notebooks environment in TIBCO Data Science - Team Studio.
Before you begin This prerequisite update applies only if you created a notebook before version 6.5.0 of TIBCO Data Science - Team Studio using the Initialize Pyspark for Cluster function. This is required to accommodate Spark upgrades in the system.
  1. Regenerate the PySpark context by clicking Data > Initialize Pyspark for Cluster.

  2. Change the previously-generated code to the following:
    os.environ['PYSPARK_SUBMIT_ARGS']=
    "--master yarn-client --num-executors 1 --executor-memory 1g --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.11:3.0.1
    pyspark-shell"

If you do not have access to a Hadoop cluster, you can run your PySpark job in local mode. Before running PySpark in local mode, set the following configuration.

  1. Set the PYSPARK_SUBMIT_ARGS environment variable as follows:
    os.environ['PYSPARK_SUBMIT_ARGS']= '--master local pyspark-shell'
  2. YARN_CONF_DIR environment variable as follows:
    os.environ['YARN_CONF_DIR'] = ''
    Procedure
  1. Create a new notebook.
  2. Click Data, and then select Initialize PySpark for Hadoop Distribution.

    Data tab - Initialize PySpark for Hadoop Distribution options

  3. A SELECT DATA SOURCE TO CONNECT dialog appears. Select an existing Hadoop data source, and then click Add Data Source.

  4. A bit of code is inserted into your notebook. This facilitates the communication between the data source and your notebook. If you want to read more data source, repeat step 1 to step 3 (maximum limit is 3 for inputs to a python execute operator). To run this code, press shift+enter or click Run.

    Now, you can run other commands by referring to the comments in the inserted code.

    The commands use the object cc which is an instantiated object of class ChorusCommander with the required parameters for the methods to work correctly. The generated code sets the sqlContext argument to the initialized Spark Session in the cc.read_input_file method call. You can set spark_options dictionary argument to pass additional options to Spark Data Frame Reader for the CSV format.

  5. To read the data sets, uncomment the lines in the generated code for that data set along with the _props variable having the corresponding spark_options.

  6. To use the notebook as a python execute operator, change the use_input_substitution parameter from False to True and add the execution_label parameter for data sets to be read. The execution_label value should start from string '1', followed by '2', and '3' for subsequent data sets. For more information, see help(cc.read_input_file).

  7. The generated cc.read_input_file method call returns a Spark Data Frame. You can modify, copy, or perform any other operations on the Data Frame as required.

  8. Once the required output Spark Data Frame has been created, write it to a target table using the cc.write_output_file.

    • To enable the use of output in downstream operators, set use_output_substitution=True.

    • To overwrite any existing files in the same path as the target, set the overwrite_exists parameter to True.

    • You can set spark_options dictionary argument to pass additional options to Spark Data Frame Writer for the CSV format.

    For more information, see help(cc.write_output_file).

    Note:
    • If this is not a terminal operator in the workflow, then do not set header=True because subsequent operators use operator metadata and not the header line in the output file.

    • You can use comma(,) as a delimiter argument in write_output_file for compatibility with other operators in a workflow, or do not use a delimiter argument.

  9. Run the notebook manually for the metadata to be determined so that the notebook can be used as a Python Executor operator in legacy workflows.

Sample commands

Use the following sample commands to write a table:

cc.write_output_table(ds1_adult_csv, table_name='adult_outemr2_hadoop.csv', schema_name='Compute_s3_local',
database_name='', sqlContext=spark, spark_options=ds1_props,
overwrite_exists=True, drop_if_exists=True, use_output_substitution=True)

Use the following sample commands to write a file:

cc.datasource_name = 'EMR535_Large3'
cc.write_output_file(ds1_adult_csv, file_path='/tmp/adult_out.csv',
sqlContext=spark, file_type='csv', spark_options={}, overwrite_exists=True)